Posted on 09/07/2017 at 5:00 PM by The Vibe.
Second month got off to a great start with 3 Pull Requests. Through this blog post, I'd like to document the progress that happened in this week. Despite working on a fun Ruby project called colorls, I'm glad that I was able to accomodate this week's GSoC timeline : XLSX Importer, CSV Importer and CSV Exporter.
The XSLX Importer is supposed to import a Daru::DataFrame
from *.xlsx
files, which was
previously not possible with the from_excel
method, as the spreadsheet
gem doesn't
support parsing of xlsx files. The roo
gem has been used to parse the data from the
*.xlsx
file.
module Daru
module IO
module Importers
class XLSX < Base
def call
book = Roo::Excelx.new(@path)
worksheet = book.sheet(@sheet)
data = worksheet.to_a
data = strip_html_tags(data)
order = @headers ? data.delete_at(0) : (0..data.first.length-1).to_a
Daru::DataFrame.rows(data, order: order)
end
end
end
end
end
As the roo
gem by default transforms all the parsed data to HTML-tagged data. However,
it would not be a preferable to have HTML-tagged data for analyzing the data. Hence, the
HTML tags had to be stripped from the parsed HTML-tagged data. This is done by the private
method strip_html_tags
.
module Daru
module IO
module Importers
class XLSX < Base
private
def strip_html_tags(data)
data.map do |row|
row.map do |ele|
next ele unless ele.is_a?(String)
ele.gsub(/<[^>]+>/, '')
end
end
end
end
end
end
Progress related to this can be tracked from this Pull Request and this Issue.
This Pull Request adds the following improvements to the CSV Importer module -
.csv.gz
format with compression: :gzip
option (As per issue #127).The zlib gem was used to read from the .csv.gz
compressed file as a String
and convert it to an Array
data with CSV.parse()
.
:skiprows
option to skip the first few rows (As per issue #220).The basic logic involved in skipping the rows is to crop the 2D-Array obtained from the
Importer with arr_of_arr[@skiprows..-1]
.
This Pull Request adds the option
to write a .csv.gz
format with compression: :gzip
option (As per
issue #127). The
zlib gem was used to write the Daru::DataFrame
into
compressed .csv.gz
file.
The processing of Daru::DataFrame
into a writable Array of Array format is taken care of,
by the process_dataframe()
method - as this is a common step. Then, an adapter chosen is
decided depending on the :compression
option given.
module Daru
module IO
module Exporters
class CSV < Base
def call
contents = process_dataframe
if @compression == :gzip
::Zlib::GzipWriter.open(@path) do |gz|
contents.each { |content| gz.write("#{content.join(',')}\n") }
gz.close
end
else
csv = ::CSV.open(@path, 'w', @options)
contents.each { |content| csv << content }
csv.close
end
end
private
def process_dataframe
order = [@dataframe.vectors.to_a] unless @headers == false
data = @dataframe.map_rows do |row|
next row.to_a unless @convert_comma
row.map { |v| v.to_s.tr('.', ',') }
end
return data if order.nil?
return order if data.empty?
order + data
end
end
end
end
end